loading in the data:
library(curl)
## Using libcurl 8.10.1 with Schannel
f <- curl("https://raw.githubusercontent.com/clairezng/clairezh-AN588-Replication/main/DATASET_CARDOSO&OTTONI.csv")
d <- read.csv(f, header = TRUE, stringsAsFactors = FALSE)
head(d) # looks fine
## ID POPULATION SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 1 TOR SCNP MALE ADULT 53 7715 145.57 0
## 2 BEI SCNP MALE ADULT 35 9974 284.97 1
## 3 ZAN SCNP MALE ADULT 63 10663 168.78 5
## 4 NIC SCNP MALE ADULT 42 11793 280.79 0
## 5 ZEN SCNP MALE ADULT 38 6513 171.39 2
## 6 CLA SCNP MALE ADULT 27 5570 206.30 0
## PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 1 489 457 70
## 2 656 616 52
## 3 655 632 91
## 4 797 735 89
## 5 548 518 36
## 6 342 330 29
authors stated that the SCNP capuchins visited the boxes at a mean = 213 visits/day; SD = 55; total individual visits = 1067. the FBV capuchins visited the boxes at a mean = 29 visits/day; SD = 12; total individual visits = 376
partitioning the two populations into separate datasets
SCNP <- subset(d, POPULATION == "SCNP")
FBV <- subset(d, POPULATION == "FBV")
SCNP
## ID POPULATION SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 1 TOR SCNP MALE ADULT 53 7715 145.57 0
## 2 BEI SCNP MALE ADULT 35 9974 284.97 1
## 3 ZAN SCNP MALE ADULT 63 10663 168.78 5
## 4 NIC SCNP MALE ADULT 42 11793 280.79 0
## 5 ZEN SCNP MALE ADULT 38 6513 171.39 2
## 6 CLA SCNP MALE ADULT 27 5570 206.30 0
## 7 BLP SCNP MALE SUB 34 9010 265.00 0
## 8 CAP SCNP MALE JUVENILE 47 9504 202.21 2
## 9 LIM SCNP MALE JUVENILE 27 6219 230.33 11
## 10 COR SCNP MALE JUVENILE 53 9659 182.25 7
## 11 VOL SCNP MALE JUVENILE 64 9201 143.77 10
## 12 PAD SCNP MALE JUVENILE 59 8175 138.56 10
## 13 DES SCNP MALE JUVENILE 66 7281 110.32 13
## 14 CIN SCNP MALE JUVENILE 77 7043 91.47 0
## 15 GOR SCNP FEMALE ADULT 56 7040 125.71 376
## 16 MAC SCNP FEMALE ADULT 69 8730 126.52 268
## 17 BEM SCNP FEMALE ADULT 58 7850 135.34 598
## 18 CAN SCNP FEMALE ADULT 27 4804 177.93 430
## 19 NIN SCNP FEMALE ADULT 17 977 57.47 7
## 20 LIC SCNP FEMALE ADULT 30 3514 117.13 36
## 21 ALI SCNP FEMALE ADULT 42 5131 122.17 318
## 22 VES SCNP FEMALE ADULT 16 2373 148.31 93
## 23 BAT SCNP FEMALE JUVENILE 67 7944 118.57 292
## PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 1 489 457 70
## 2 656 616 52
## 3 655 632 91
## 4 797 735 89
## 5 548 518 36
## 6 342 330 29
## 7 635 601 31
## 8 784 766 78
## 9 278 258 42
## 10 805 726 103
## 11 262 156 25
## 12 150 37 44
## 13 22 4 12
## 14 0 0 0
## 15 0 NA NA
## 16 0 NA NA
## 17 0 NA NA
## 18 0 NA NA
## 19 0 NA NA
## 20 0 NA NA
## 21 0 NA NA
## 22 0 NA NA
## 23 0 NA NA
FBV # looks okay!
## ID POPULATION SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 24 TEI FBV MALE ADULT 20 1162 58.10 13
## 25 MSN FBV MALE ADULT 10 625 62.50 0
## 26 JTB FBV MALE ADULT 38 1647 43.34 5
## 27 CAT FBV MALE SUB 48 3169 66.02 5
## 28 COC FBV MALE JUVENILE 38 3475 91.45 74
## 29 CNG FBV MALE JUVENILE 11 757 68.82 7
## 30 PAT FBV MALE JUVENILE 24 2507 104.46 32
## 31 DIT FBV FEMALE ADULT 59 3012 51.05 2
## 32 CHU FBV FEMALE ADULT 50 4336 86.72 6
## 33 PSS FBV FEMALE ADULT 18 2601 129.17 10
## 34 AMR FBV FEMALE ADULT 7 827 118.14 0
## 35 DOR FBV FEMALE JUVENILE 7 215 30.71 0
## 36 PAM FBV FEMALE JUVENILE 20 2325 130.05 13
## 37 PAS FBV FEMALE JUVENILE 18 948 52.67 0
## PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 24 0 NA NA
## 25 0 NA NA
## 26 0 NA NA
## 27 0 NA NA
## 28 0 NA NA
## 29 0 NA NA
## 30 0 NA NA
## 31 0 NA NA
## 32 0 NA NA
## 33 0 NA NA
## 34 0 NA NA
## 35 0 NA NA
## 36 0 NA NA
## 37 0 NA NA
calculating mean and SD number of visits for the SCNP population:
sum(SCNP$NUMVISIT)
## [1] 1067
mean(SCNP$NUMVISIT)
## [1] 46.3913
sd(SCNP$NUMVISIT)
## [1] 17.70141
the total number of visits adds up fine. maybe they made a typo? I don’t understand how the dataset I have could possibly turn out a mean of 213 visits/days of presentation.
looking at the FBV population:
sum(FBV$NUMVISIT)
## [1] 368
mean(FBV$NUMVISIT)
## [1] 26.28571
sd(FBV$NUMVISIT)
## [1] 17.19315
okay. well the total doesn’t even add up correctly, much less the mean and standard deviation hello? what am I doing wrong, because there’s no way THEIR data is just crazy incorrect.
trying the “Mann-Whitney test” they do. They say the Z = -4541, p < 0.0001. I’m honestly not even sure what they’re calculating the statistics of, but my best guess is that it’s comparing the frequency of probe use in the SCNP group vs. the FBV population. also, I’m assuming that the Mann-Whitney test is an unpaired two-sample Wilcoxon test
mw.probes <- wilcox.test(SCNP$PROBE.EVENT, FBV$PROBE.EVENT, paired = FALSE)
## Warning in wilcox.test.default(SCNP$PROBE.EVENT, FBV$PROBE.EVENT, paired =
## FALSE): cannot compute exact p-value with ties
mw.probes
##
## Wilcoxon rank sum test with continuity correction
##
## data: SCNP$PROBE.EVENT and FBV$PROBE.EVENT
## W = 252, p-value = 0.0008902
## alternative hypothesis: true location shift is not equal to 0
this doesn’t give me enough information, except that the p-value isn’t correct, so maybe they ran the Mann-Whitney test on something else?
in the table, they list values for length of the direct engagement with the task and the mean time of visits, so I’ll try it for that too:
mw.time <- wilcox.test(SCNP$TIME.VISIT, FBV$TIME.VISIT, paired = FALSE)
mw.time # p-value = 2.987e-07, which could definitely be the p < 0.0001 value they're talking about
##
## Wilcoxon rank sum exact test
##
## data: SCNP$TIME.VISIT and FBV$TIME.VISIT
## W = 306, p-value = 2.987e-07
## alternative hypothesis: true location shift is not equal to 0
mw.meantime <- wilcox.test(SCNP$Mean_TIMEVISIT, FBV$Mean_TIMEVISIT, paired = FALSE)
mw.meantime # okay, this spits out a p-value = 5.623e-06, which could also be the p < 0.0001 value they list
##
## Wilcoxon rank sum exact test
##
## data: SCNP$Mean_TIMEVISIT and FBV$Mean_TIMEVISIT
## W = 294, p-value = 5.623e-06
## alternative hypothesis: true location shift is not equal to 0
these definitely seem more promising. now I need to figure out the Z-score of -4541, which seems absurd? I wonder if the “Z” is supposed to represent something else. Also, in order to assume a normal distribution, both \(n_{1}\) (the SCNP group) and \(n_{2}\) (the FBV group) would have to be more than 20, which they aren’t.
Nevertheless, we have to convert the W values the Wilcoxon Rank Sum test spits out into Mann-Whitney U values.
R tells me that R’s value can also be computed as the number of all pairs (x[i], y[j]) for which y[j] is not greater than x[i], the most common definition of the Mann-Whitney test. To ME, this seems like the W value is also the value of U.
Running into another problem here: 306 and 294 seem like the larger of the two U values, so I’ll reorder my variables
mw.time2 <- wilcox.test(FBV$TIME.VISIT, SCNP$TIME.VISIT, paired = FALSE)
mw.time2
##
## Wilcoxon rank sum exact test
##
## data: FBV$TIME.VISIT and SCNP$TIME.VISIT
## W = 16, p-value = 2.987e-07
## alternative hypothesis: true location shift is not equal to 0
mw.meantime2 <- wilcox.test(FBV$Mean_TIMEVISIT, SCNP$Mean_TIMEVISIT, paired = FALSE)
mw.meantime2
##
## Wilcoxon rank sum exact test
##
## data: FBV$Mean_TIMEVISIT and SCNP$Mean_TIMEVISIT
## W = 28, p-value = 5.623e-06
## alternative hypothesis: true location shift is not equal to 0
there we go! U = 16 when comparing total direct interaction time, and U = 28 when comparing mean time per visit.
in order to get a Z-value, we must assume U is approximately normally distributed. so let’s just say it is, even though the sample sizes fall a little bit short.
\[z = \frac{U - \mu_{U}}{\sigma_{U}}\]
where the mean of U \(\mu_{U} = \frac{n_{1}n_{2}}{2}\) and the standard deviation of U \(\sigma_{U} = \sqrt{\frac{n_{1}n_{2}(n_{1}+n_{2}+1)}{12}}\)
I’m going to write a function to calculate the Z-score at least, because I need to run two different values (and may have to deal with more later:
note that I tried to use the qnorm() function to find the Z-score. just trust me that it did not work.
i’m also going to use a continuity correction of 0.5, because the M-W test is categorical and my sample size is small. lol
mw.z <- function(U, n1, n2, correction = TRUE) {
mean_U <- (n1*n2) / 2
sd_U <- sqrt((n1*n2*(n1+n2+1))/12)
correction <- if (correction) 0.5 else 0
z <- (U - mean_U - correction) / sd_U
print(z)
}
mw.z(16, length(FBV$TIME.VISIT), length(SCNP$TIME.VISIT))
## [1] -4.556526
mw.z(28, length(FBV$Mean_TIMEVISIT), length(SCNP$Mean_TIMEVISIT))
## [1] -4.18073
these are terribly close (minus the decimal point). maybe i’ll take out the continuity correction:
mw.z(16, length(FBV$TIME.VISIT), length(SCNP$TIME.VISIT), correction = FALSE)
## [1] -4.540868
mw.z(28, length(FBV$Mean_TIMEVISIT), length(SCNP$Mean_TIMEVISIT), correction = FALSE)
## [1] -4.165072
OH MY GOD the Z = -4.540868 is what we’re looking for because if you round it to the thousandth it’s -4.541 which is the number they state. except (obviously) their value is missing a decimal point but maybe that’s just a typo. THANK GOD. okay so they’re looking at the difference between the length of direct interaction(s) between the two groups. holyyyy
moving on to recreating figure one and figure two. they are pretty simple bar graphs.
library(ggplot2)
names(SCNP)
## [1] "ID" "POPULATION" "SEX"
## [4] "GROUP.AGE" "NUMVISIT" "TIME.VISIT"
## [7] "Mean_TIMEVISIT" "FINGERS" "PROBE.EVENT"
## [10] "SUCCESSFUL" "NUMBER.OF.PROBE.TOOLS"
SCNP_males <- subset(SCNP, SEX == "MALE")
SCNP_males <- SCNP_males[SCNP_males$ID !="CIN", ]
SCNP_males # checking this
## ID POPULATION SEX GROUP.AGE NUMVISIT TIME.VISIT Mean_TIMEVISIT FINGERS
## 1 TOR SCNP MALE ADULT 53 7715 145.57 0
## 2 BEI SCNP MALE ADULT 35 9974 284.97 1
## 3 ZAN SCNP MALE ADULT 63 10663 168.78 5
## 4 NIC SCNP MALE ADULT 42 11793 280.79 0
## 5 ZEN SCNP MALE ADULT 38 6513 171.39 2
## 6 CLA SCNP MALE ADULT 27 5570 206.30 0
## 7 BLP SCNP MALE SUB 34 9010 265.00 0
## 8 CAP SCNP MALE JUVENILE 47 9504 202.21 2
## 9 LIM SCNP MALE JUVENILE 27 6219 230.33 11
## 10 COR SCNP MALE JUVENILE 53 9659 182.25 7
## 11 VOL SCNP MALE JUVENILE 64 9201 143.77 10
## 12 PAD SCNP MALE JUVENILE 59 8175 138.56 10
## 13 DES SCNP MALE JUVENILE 66 7281 110.32 13
## PROBE.EVENT SUCCESSFUL NUMBER.OF.PROBE.TOOLS
## 1 489 457 70
## 2 656 616 52
## 3 655 632 91
## 4 797 735 89
## 5 548 518 36
## 6 342 330 29
## 7 635 601 31
## 8 784 766 78
## 9 278 258 42
## 10 805 726 103
## 11 262 156 25
## 12 150 37 44
## 13 22 4 12
I’m also going to reorder individuals because I want to match the order they’re listed in within the paper.
SCNP_males$ID <- factor(SCNP_males$ID, levels = c("TOR", "BEI", "ZAN", "NIC", "ZEN", "CLA", "BLP", "CAP", "LIM", "COR", "VOL", "PAD", "DES"))
# plotting!
ggplot(data=SCNP_males, aes(x=ID, y=NUMBER.OF.PROBE.TOOLS)) +
geom_bar(stat="identity", fill="red", width = 0.5) +
labs(x = element_blank(), y = "no. probe tools", caption = "Figure 1. Number of sticks used by each monkey of the SCNP group.") +
theme(plot.caption = element_text(hjust = 0)) +
scale_y_continuous(breaks = seq(0, 100, by = 10), expand=expansion(mult=c(0,0.01)))
and here is the original Figure 1 from the paper:
this makes the fact that the figures are inaccurate MUCH more noticeable - the researchers’ graphs definitely aren’t to scale, and it seems to outright plot some values incorrectly?
Figure 2 replication:
# first, calculating proportions of successful probing:
SCNP_males$PROBE.SUCCESS <- SCNP_males$SUCCESSFUL/SCNP_males$PROBE.EVENT
SCNP_males$PROBE.SUCCESS # looks fine
## [1] 0.9345603 0.9390244 0.9648855 0.9222083 0.9452555 0.9649123 0.9464567
## [8] 0.9770408 0.9280576 0.9018634 0.5954198 0.2466667 0.1818182
# plotting:
ggplot(data=SCNP_males, aes(x=ID, y=PROBE.SUCCESS)) +
geom_bar(stat="identity", fill="navy", width = 0.5) +
labs(x = element_blank(), y = "proportion of successful probing", caption = "Figure 2. Proportions of successful probing by each monkey of the SCNP group.") +
theme(plot.caption = element_text(hjust = 0)) +
scale_y_continuous(breaks = seq(0, 1.0, by = 0.2), expand = expansion(mult = c(0,0.05)))
and here is the original Figure 2:
My y-axis intervals aren’t entirely accurate, but I can’t comprehend why the original figure lists 0, 0.3, 0.5, 0.8, and 1.